Concepts of Basic Quantitative Data Analysis
2025-12-05
Training curriculum
Training Outline
5 modules · 10 topics
01
What is Statistics?
- Definition & scope : role of statistics in research and decision-making
- Types of data : categorical, continuous, ordinal, discrete variables
02
Branches of Statistics
- Descriptive statistics : summarizing and visualizing data distributions
- Inferential statistics : drawing conclusions from samples to populations
03
Hypothesis Testing & Statistical Errors
- T-Tests & ANOVA : comparing means across two or more groups
- Confidence intervals vs P-values : interpreting uncertainty & significance
- Type I & Type II errors : false positives, false negatives, and power
04
Measures of Association
- Chi-Square Test : independence between categorical variables
- Pearson's Correlation : strength and direction of linear relationships
05
Introduction to Statistical ModelsAdvanced
- Regression analysis : linear & logistic regression, model interpretation
- Survival data analysis : Kaplan-Meier curves, Cox proportional hazards
- Longitudinal data analysis : mixed models for repeated measures
Goal of the training
The goal of this training is to:
- Provide an overview of fundamental statistical methods for the quantitative data analysis, including key theoretical results presented.
- Equip staff with essential knowledge and skills to interpret data effectively.
- Apply statistical methods that support organizational goals.
- Introduce Advanced Topics in Quantitative Data Analysis:
- Regression analysis, Survival Analysis, Longitudinal data analysis
- Enable trainees to pose scientific questions within the context of appropriate statistical methods and carry out and interpret analyses effectively.
What is statistics and quantitative data analysis?
- Statistics is the science of:
- Collecting, analyzing, summarizing, and interpreting data
- Drawing conclusions to support effective decision-making.
- Transforms raw data into meaningful information.
- Helps us learn about the world through data-driven insights.
- Widely applied in fields like:
- Biology, sociology, economics, public health, and business
- When applied to biological or health-related data, it is referred to as Biostatistics
Branches of Statistics
- The field of statistics consists of two major branches that help us describe data and make informed decisions based on evidence.
Purpose of Descriptive Statistics
- Describe the research sample
- Understand background characteristics of study participants
- Ensures proper interpretation of findings
- Understand the data
- Summarizes main study variables before inferential analysis
- Supports answering research questions in descriptive studies
- Aids in data cleaning and outlier detection
- Check assumptions for inferential statistics
- Example: Histograms assess normality in univariate/multivariable analysis
- Present descriptive results appropriately
- Choice of presentation method depends on variable types
- Understanding variable types is essential for accurate summary
Purpose of Descriptive Statistics
Types of variables
What is a Variable?
A variable is any characteristic, number, or quantity that can be measured or counted and differs across individuals or observations.
- Two Main Types of Variables
a). Categorical Variables (Qualitative)
- Describe groups or categories.
- Examples: Sex, Region, Marital Status, Education Level.
b). Numerical Variables (Quantitative)
- Represent numbers and quantities.
- Examples: Age, Income, Height, Number of Children.
MEASURE OF CENTRAL TENDENCY
Measure of central tendency
- Refers to a statistical measure that identifies a single value representing the entire distribution.
- Aims to provide an accurate summary of the overall data.
- Represents the most typical or central value in a dataset.
- Known as the “number crunching” part of data description.
- Three common measures:
- Mean – the average of all values
- Median – the middle value when data is ordered
- Mode – the most frequently occurring value
Covers statistical methods for describing data using statistical characteristics, charts, graphics or tables.
![]()
Mean
- Also known as the “average” in everyday language.
- Applicable only to continuous variables (interval or ratio scale).
- Represents the center of gravity of a distribution.
- Sum of all values ÷ Number of values (n)
- Provides a balanced summary of the dataset.
Mode (Modal Value)
- The mode is the most frequent value in a distribution.
- The mode can be used for both continuous and categorical (nominal or ordinal) variables.
- Can be non-existent, unique (unimodal), or multiple (bimodal, etc.).
- Applicable to categorical data.
![]()
Measure of Dispersion
- Dispersion or variability refers to how dispersed or deviated the values from one another in a distribution.
- Three most widely used measures of dispersion are:
- Variance, Standard deviation
- Interquartile and Range
- Variance
- Variance measures the amount of spread or variability of observation from mean.
- The sample variance (s²) is the average of the squares of each deviation from the sample mean.
\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\left(x_i - \bar{x}\right)^2\]
Standard deviation
- The standard deviation is a statistical metric that quantifies the dispersion or variability of data points relative to their mean.
- It reflects how far on average, each individual value deviates from the mean, offering insight into the spread of the data.
- A low standard deviation means that the values are close to the mean.
- The population (σ) and sample (S) SD formula:
![]()
Example of variance and standard deviation calculation
\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\left(x_i - \bar{x}\right)^2\]
Interquartile range (IQR)
- IQR is the distance between the 1st and 3rd quartile.
- It is not sensitive to extreme values (outliers). Thus, it is usually described together with the median in skewed distribution of observation.
\[\text{Formula: } IQR = 3rd\ quartile - 1st\ quartile = q_3 - q_1\]
ORGANIZATION AND PRESENTATION OF DATA (TABLES & GRAPHS)
Tabular Presentation
Statistic presentation Frequency Table. Effective for presenting large amounts of data. Tables should have clear titles, row and column labels, and units of measurement.
Charts
- In charts, data is represented graphically.
- therefore they are used in statistics mainly to get an overview of the collected data and to prepare information in an easily understandable way.
- i.e., to display data patterns, relationships, and trends.
- The most commonly used charts in statistics are bar charts, histograms, scatter plots, line plots, box plots or pie charts.
Bar chart
- Bar charts are one of the most common charts in scientific work and they are mostly used to show either frequencies or averages.
- Depending on how the bars are arranged, a distinction is made between horizontal and vertical bar charts.
Bar chart (continued)
- Due to the simplicity of bar charts, they are often created in descriptive statistics.
- They provide a very quick overview of the ranking and frequencies of characteristic values.
- A bar chart shows absolute and relative frequencies on a two-axis coordinate system.
![]()
Grouped bar charts
- If two categorical variables are present, grouped bar charts can be created.
- In grouped bar charts, either the frequency, the percent, or the percent in each group can be specified.
Overall, bar chart is
Histogram
- A histogram is a graphical representation of the frequency distribution of a numeric variable.
- To display a distribution of data in a histogram, the data must first be divided into classes, also called bins.
- These classes or bins are then represented by rectangles that lie directly next to each other.
Used for normality checking
Polygon
- A frequency polygon is a graph that displays the data using lines to connect points plotted for the frequencies
Box plot
- Boxplots provide a compact visual summary of data distributions.
- Key features displayed include:
- Median (central value), Interquartile Range (IQR) (middle 50% of the data), Outliers (unusual values outside the typical range)
- Ideal for continuous data (e.g., age, income, temperature).
Commonly used to compare multiple groups.
![]()
- Comparing age distributions across different population groups.
How is a boxplot interpreted?
- The box itself indicates the range in which the middle 50% of all values lie.
- Thus, the lower end of the box is the 1st quartile, and the upper end is the 3rd quartile.
Boxplot illustration
Scatter Plots
- Used to visualize correlations between two variables.
- Each data point represents a pair of values in a coordinate system.
- Example: Plotting height vs. weight for individuals.
What can you say for this graph?
Scatter Plots (continued)
- Helps identify the type of correlation:
- Positive correlation: Both variables increase together.
- Negative correlation: One variable increases while the other decreases.
- No correlation: Points appear randomly scattered.
![]()
Can also show non-linear relationships where data follows a pattern but not a straight line.
Good practice in data presentation: tabular result
- When presenting descriptive statistics in tables, it is good practice to keep the table simple, organized, and easy to read.
- Use clear headings for each variable, and label columns properly (e.g., mean, median, SD).
- Align numbers neatly, usually to the right, and round them to a consistent number of decimal places.
- Always add a clear table title and footnotes if any explanations or abbreviations are used, so the reader can understand the table without extra help.
Tabular presentation or reporting
Graphical data presentation or reporting
- Choose simple, clear, and appropriate graphs that match the type of data.
- For categorical data, use bar charts or pie charts to show frequencies or proportions.
- For quantitative data, use histograms for distribution, box plots for medians and spread, and scatter plots to show relationships between two variables.
Always label the axes clearly, include units where needed, and use a title that explains what the graph shows.
Inferential Statistics
- Involves drawing conclusions about a population from a random sample. Educated guess rather that describing data.
- Useful when it is impractical to study the entire population.
- Allows generalization from sample data to a larger group.
- Example:
- Evaluating the effect of a cash transfer program by surveying a sample of recipients and inferring the results to the broader population.
- Key Methods: depends on data type, assumption, objective
- Group comparison tests:
- t-test, chi-square test, ANOVA (Analysis of Variance)
- Relationship/correlation tests:
- Correlation analysis, regression analysis
Random Sample and Sampling Error
- A random and representative sample approximates the population, but is rarely identical to it.
- The difference between a sample statistic and the true population value is called sampling error.
- Sampling error is:
- Natural and expected
- Unavoidable when using samples instead of full populations
Random Sample and Sampling Error (continued)
- Since population values are unknown, the exact sampling error is also unknown.
- We use statistical methods to estimate sampling error and evaluate reliability.
Role of Hypothesis Testing:
- Helps determine whether sample results reflect true population effects or are due to random chance.
- Accounts for expected variability caused by sampling error.
Hypothesis testing
- A hypothesis is an assumption about a relationship or effect that is neither proven nor disproven at the start of a study.
- It is developed based on a research question and typically justified through a literature review.
- A hypothesis proposes an expected association (e.g., “Men earn more than women in the same job in Ethiopia”).
- The goal is to reject the hypothesis based on data analysis.
- Data from surveys or experiments are used, and a hypothesis test (e.g., t-test, correlation analysis) is applied.
Null and alternative hypothesis
- In hypothesis testing, we always define two opposing hypotheses:
- Null hypothesis (H₀): Assumes no difference or no effect between groups.
- Example: The salaries of men and women in Ethiopia do not differ.
- Alternative hypothesis (H₁): Assumes a difference or an effect between groups.
- Example: The salaries of men and women in Ethiopia differ.
- The hypothesis you want to prove (based on theory or research) is usually the alternative hypothesis.
- In a hypothesis test, you only test the null hypothesis and decide whether to reject it.
Level of significance or probability of error
- In hypothesis testing, we can never be 100% sure when rejecting the null hypothesis, there is always a small chance we are wrong.
- The significance level (α) is the allowed probability of wrongly rejecting a true null hypothesis.
- If the p-value is smaller than α, we reject the null hypothesis.
- If the p-value is greater than α, we do not reject the null hypothesis.
- Common significance levels are: 5% (α = 0.05) → 5% risk of wrongly rejecting a true null hypothesis.
Example: Two-Sample t-Test
- Used to compare the means of two independent groups.
- A larger difference between sample means suggests it’s less likely both groups come from the same population.
Key Concepts:
- If p-value < significance level (α) → Reject H₀ (null hypothesis)
- Example: p = 0.04 < 0.05 → there’s a 4% chance the observed (or more extreme) mean difference occurred by random chance, assuming no true difference.
- The significance level (α):
- Must be set before analysis (commonly 0.05)
- Must not be adjusted afterward to influence results
Types of Errors in Hypothesis Testing
Hypothesis testing is based on sample data, so errors can occur due to random variation. No test is 100% foolproof - sample results naturally vary by chance.
Two Main Types of Errors:
- Type I Error (α):
- Rejecting the null hypothesis when it is actually true
- False positive → concluding there is an effect when there isn’t
- Type II Error (β):
- Failing to reject the null hypothesis when the alternative is true
- False negative → missing a real effect that exists
Type I Error (\(\alpha\)) - False Positive
- Occurs when the null hypothesis is rejected even though it is actually true.
- Also called a false positive → concluding there is an effect when there isn’t one.
- The probability of committing a Type I error is the significance level \(\alpha\), set by the researcher (commonly \(\alpha = 0.05\)).
\[P(\text{Type I Error}) = \alpha\]
Example: A drug trial concludes the drug is effective, but in reality it has no effect. The result was due to random chance in the sample.
Type II Error (\(\beta\)) - False Negative
- Occurs when the null hypothesis is not rejected even though the alternative hypothesis is true.
- Also called a false negative → missing a real effect that actually exists.
- The probability of committing a Type II error is denoted \(\beta\).
\[P(\text{Type II Error}) = \beta\]
Example: A drug trial concludes the drug has no effect, but it actually does work. The study failed to detect the real effect.
Statistical Power
The power of a test is the probability of correctly rejecting \(H_0\) when it is false (the ability to detect a true effect).
\[\text{Power} = 1 - \beta\]
The Trade-off Between Type I and Type II Errors
- Decreasing \(\alpha\) (stricter threshold) → reduces Type I errors, but increases Type II errors (\(\beta\)).
- Increasing \(\alpha\) (lenient threshold) → reduces Type II errors, but increases Type I errors.
- The only way to reduce both simultaneously is to increase the sample size \(n\).
Why Errors Happen?
- Sample results vary by chance
- no sample perfectly represents the population.
- No test is 100% foolproof
- a decision based on probabilities will occasionally be wrong.
- The goal is not to eliminate errors entirely, but to control and minimise them through
- appropriate study design,
- sample size, and
- significance thresholds.
Choosing the Right Statistical Test
Selecting the appropriate test depends on three key factors:
- Type of variables - categorical (nominal/ordinal) or numeric (continuous/discrete)
- Number of groups or samples - one, two, or more groups
- Relationship between samples - independent or related (paired)
Common Statistical Tests
| T-test |
Compares mean differences |
Numeric (DV), Categorical (IV) |
| Chi-square |
Tests association between categorical variables |
Categorical |
| One-way ANOVA |
Compares means across 3+ independent groups |
Numeric (DV), Categorical (IV) |
| Two-way ANOVA |
Examines two factors and their interaction effects |
Numeric (DV), Two categorical (IV) |
a) T-test
Used to compare mean differences between groups. Three variants:
| One-sample t-test |
Compare a sample mean to a known population value |
Is the average exam score different from 70? |
| Independent two-sample t-test |
Compare means of two unrelated groups |
Do males and females differ in weight? |
| Paired-sample t-test |
Compare means within the same group at two time points |
Did weight change after an intervention? |
\(t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \quad \text{(one-sample)}\)
\(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s^2_p\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \quad \text{(independent two-sample)}\)
\(t = \frac{\bar{d}}{s_d / \sqrt{n}} \quad \text{(paired)}\)
b) Chi-Square Test
Tests the association between two categorical variables.
\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]
- \(O_i\) = observed frequency, \(E_i\) = expected frequency
- Used for nominal or ordinal data - never for continuous variables
- See the Chi-Square Test notes for full breakdown.
c) One-way / Two-way ANOVA
- One-way ANOVA: Compares means across three or more independent groups on one factor.
- Two-way ANOVA: Examines two independent factors simultaneously, including their interaction effect on the outcome.
\[F = \frac{\text{Variance between groups}}{\text{Variance within groups}}\]
If \(F\) is large, the between-group differences are unlikely due to chance alone.
Checking Assumptions of Statistical Tests
- Most statistical tests require certain assumptions to be valid.
- Assumptions should always be checked before running tests - violating them can lead to invalid conclusions.
How to Check Assumptions
| Numerically |
Compare mean vs. median; assess skewness and kurtosis |
| Statistically |
Use formal normality tests (Shapiro-Wilk, Kolmogorov-Smirnov) |
| Graphically |
Use boxplots, histograms, Q-Q plots |
Normality Assumption
Many tests including the t-test and ANOVA assume that the data (or residuals) follow a normal distribution.
Graphical Checks
Histogram
- Visually inspect the shape of the distribution.
- A bell-shaped, symmetric histogram suggests normality.
- Skewed or multi-modal shapes suggest non-normality.
Q-Q Plot (Quantile-Quantile Plot)
- Plots observed quantiles against theoretically expected quantiles under normality.
- If points fall along the diagonal line → normality is supported.
- Systematic deviations from the line → normality is violated.
What to Do When Normality Is Violated
| Mild violation, large \(n\) |
Parametric tests are still robust (Central Limit Theorem applies) |
| Small sample, clear skew |
Use non-parametric alternatives |
| T-test violated |
Use Mann-Whitney U (independent) or Wilcoxon (paired) |
| ANOVA violated |
Use Kruskal-Wallis test |
| Chi-square violated |
Use Fisher’s Exact Test |
Confidence Interval
- A CI defines a range where the true population parameter (e.g., mean) is likely to lie.
- Sample estimates (mean, variance) are only approximations of the true population values.
- Based on the sample mean, sample size (n), and sample standard deviation (s) and assumes a normal distribution of the parameter, CI is given as
Confidence Interval (continued)
If the sample is small, the t-distribution is used instead of the normal distribution.
Then the z value is replaced by t and the formula is: \[CI = \bar{x} \pm t.\frac{s}{\sqrt{n}} \]
If a 95% confidence interval is given, you can be 95% sure that the true value of the parameter lies within that interval.
Statistical tests for differences
![]()
- One-Sample t-Test
- Purpose: Tests if the sample mean differs significantly from a known or hypothesized population mean.
- Used when: You have one sample and a fixed reference value (e.g., target, population mean).
One-Sample t-Test (continued)
- Requirements
- Random sample and approx. normal distribution of the data
- Numeric (continuous) data
Types of Questions
- Two-tailed: Is the sample mean different from the reference value?
- One-tailed: Is the sample mean greater than or less than the reference value?
Hypotheses
- H₀ (Null): μ = μ₀ (Sample mean equals the reference mean)
- H₁ (Alternative): μ ≠ μ₀ (Sample mean differs from the reference)
Example: for pre specified average µ = 28
A t-test showed a statistically reliable difference between the score of students who attended the online course and the average score of students who did not attend an online course. t(11) = 2.75, p < 0.02, α = 0.05.
Two-Sample t-Test (Independent Samples)
- Purpose: Tests if two independent groups differ significantly in their means.
- Used when: Comparing two unrelated groups (e.g., treatment vs placebo, male vs female).
- Requirements
- Two independent samples & Numerical (continuous) data
- Normal distribution in each group (or large enough samples for approximation)
- Homogeneity of variances (can be tested with Levene’s test)
Examples
- Does Drug XY reduce weight compared to a placebo?
- Is there a health difference between people with and without a degree?
Hypotheses
- H₀ (Null): μ₁ = μ₂ (No difference between group means)
- H₁ (Alternative): μ₁ ≠ μ₂ (There is a difference between group means)
What is your conclusion?
Conclusion of the independent t-test
- In this example, the p-value is 31%, which is higher than the 5% significance level, so there is no significant difference between the two groups.
- The confidence interval (-6.328 to 18.118) crosses zero, also showing no significant difference.
Report a t-test for independent samples:
- An independent samples t-test was conducted to compare exam results in summer and winter. There was not a significant difference in the scores, p =0.31. The magnitude of the differences in the means (mean difference =5.9, 95% CI: [-6.33, 18.12]) was large.
Paired-Samples t-test (Dependent t-test)
- The paired-samples t-test is used to compare two related groups to determine whether their mean difference is statistically significant.
- Pairs of observations are needed:
- Repeated measurements on the same individuals (before vs after treatment).
- Matched subjects across two groups (e.g., twins, matched cases).
- Controls for individual variability because comparisons are within the same subjects.
- Greater chance to detect a real difference if one exists.
Look at before and after weight measure
ANOVA (Analysis of Variance)
- ANOVA is used to test whether statistically significant differences exist between three or more group means.
- Why not use multiple t-tests?
- Every hypothesis test carries a risk of Type I error (commonly set at 5%).
- Running multiple t-tests increases the probability of making at least one false-positive conclusion.
- ANOVA helps to avoid inflated error rates when comparing several groups simultaneously.
- Independent variables (factors): Categorical.
- Dependent variable: Continuous variable
- F-ratio: The test statistic in ANOVA.
- It compares between-group variability to within-group variability.
Types of Anova
Most common used ANOVA for non repeated measures are:
One-factor ANOVA
- Does a person’s place of residence (independent variable) influence his or her salary?
Two-factors ANOVA
- Does a person’s place of residence (1st independent variable) and gender (2nd independent variable) affect his or her salary?
Example for One-factor ANOVA
- With the help of the dependent variable, e.g. “highest educational qualification” with the three characteristics group 1, group 2 and group 3 should be explained as much variance of the dependent variable “salary” as possible.
![]()
Accordingly, in case A) the groups have a very high influence on the salary and in case B) they do not.
ANOVA hypotheses
- Null hypothesis H0: The mean of all groups is equal.
- Alternative hypothesis H1: There are differences in the means of the groups.
- You want to check whether there is a difference in coffee consumption between students in different subjects. To do this, ask 10 students from each field of study.
Two-Way ANOVA (Two-Factor ANOVA)
- Two-way ANOVA tests whether two independent categorical variables (factors) influence a continuous dependent variable.
- It also tests whether there is an interaction effect between the two factors.
- You want to test the effects of two factors on one dependent variable.
- Example questions:
- Does gender and education affect salary?
- Does therapy type and gender affect blood pressure?
Two-Way ANOVA (continued)
Three Questions Answered by Two-Way ANOVA
- Does Gender affect the dependent variable?
- Does Education affect the dependent variable?
- Is there an interaction effect between Factor 1 and Factor 2?
Hypothesis:
Steps in Two-Way ANOVA Calculation
- Calculate group means (e.g., mean attitude scores for male & studied, male & not studied, etc.).
- Calculate overall mean of all values.
- Calculate Sums of Squares:
- SStotal (total variation from overall mean)
- SSfactorA (variation explained by Factor 1)
- SSfactorB (variation explained by Factor 2)
- SSinteraction (variation explained by interaction of A and B)
- SSerror (unexplained variation)
- Calculate degrees of freedom for each component.
- Calculate mean squares (variance estimates).
- Calculate F-values:
- F = Variance of Factor / Error Variance
- Interpret p-values to accept or reject the null hypotheses.
Two-Way ANOVA example
- Example: Does gender (male or female) and education status (studied or not) influence a person’s attitude towards retirement planning?
- Dependent Variable: Attitude towards retirement planning (rated from 1 = not important to 10 = very important).
![]()
All p-values > 0.05. No significant effect of Study, Gender and their interaction retirement planning
Chi-Square Test
The Chi-square test is a statistical hypothesis test used for categorical variables (i.e., nominal or ordinal scales, e.g., types of assistance received, satisfaction levels).
- It is used to determine whether there is a significant association between two or more categorical variables.
The Chi-square test can answer three types of questions:
| Test of Independence |
Are two categorical variables independent of each other? |
| Test of Goodness-of-Fit |
Do observed frequencies match an expected distribution? |
| Test of Homogeneity |
Do two or more groups come from the same population? |
Test of Independence
Used to test whether two categorical variables are independent of each other.
Hypotheses
\(H_0: \text{The two variables are independent (no association)}\)
\(H_1: \text{The two variables are not independent (association exists)}\)
\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]
where the expected frequency for each cell is:
\(E_{ij} = \frac{R_i \times C_j}{N}\)
\(df = (r - 1)(c - 1)\)
where \(r\) = number of rows and \(c\) = number of columns in the contingency table.
| \(O_i\) |
Observed frequency |
Actual count recorded in each cell of the contingency table |
| \(E_i\) |
Expected frequency |
Count expected under \(H_0\) if the variables were truly independent |
| \((O_i - E_i)^2\) |
Squared deviation |
Amplifies large discrepancies between observed and expected counts |
| \(E_i\) (denominator) |
Standardisation |
Scales the deviation relative to the expected count, preventing large cells from dominating |
| \(R_i\) |
Row total |
Sum of all counts in row \(i\) |
| \(C_j\) |
Column total |
Sum of all counts in column \(j\) |
| \(N\) |
Grand total |
Total number of observations |
Decision rule
- If \(\chi^2_{\text{calculated}} > \chi^2_{\text{critical}}\) at significance level \(\alpha\) → Reject \(H_0\)
- If \(\chi^2_{\text{calculated}} \leq \chi^2_{\text{critical}}\) → Fail to reject \(H_0\)
Assumptions
- Observations are independent of each other.
- Each observation falls into exactly one cell.
- Expected frequency in each cell is \(\geq 5\) (if violated, consider Fisher’s Exact Test).
- Data are counts (frequencies), not proportions or percentages.
Test of Goodness-of-Fit
Used to determine whether observed frequencies match a theoretically expected distribution.
\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]
\[df = k - 1\]
where \(k\) is the number of categories.
Test of Homogeneity
Used to determine whether two or more independent groups share the same distribution of a categorical variable.
The Formula is the same as the Test of Independence:
\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]
\(df = (r - 1)(c - 1)\)
Note: The key distinction from the Test of Independence is the study design - in homogeneity testing, group membership (rows) is fixed by the researcher, whereas in independence testing, both variables are observed freely.
Comparison of the three Chi-Square tests
| Variables |
2 categorical |
1 categorical |
1 categorical across groups |
| Data structure |
Contingency table |
Single frequency table |
Contingency table |
| Sampling |
One sample |
One sample |
Multiple independent samples |
| Question |
Association? |
Matches distribution? |
Same distribution across groups? |
| df |
\((r-1)(c-1)\) |
\(k - 1\) |
\((r-1)(c-1)\) |
Example for Test of Independence
- To test whether two categorical, Gender and Education level variables are independent.
Test of Independence (Cont’ue)
The chi-square value is calculated via:
Research Question: Is umbrella use dependent on gender?
![]()
Calculated Chi-square value is smaller than 3.841. No significant difference. Men and women do not differ significantly regarding umbrella use.
Statistical Methods for Testing Correlations
- What is Correlation?
- Correlation analysis is a statistical method used to examine the relationship between two continuous or ordinal variables.
- It measures:
- The direction of the relationship (positive or negative)
- The strength of the relationship (weak or strong)
- The measure of this relationship is the correlation coefficient, which ranges from -1 to +1.
Correlation vs Causation
- Correlation shows association, not causation.
- A strong correlation does not prove that changes in one variable cause changes in the other.
- Example: Childhood speech and school success:
- There may be a correlation, but correlation alone doesn’t prove that speaking earlier causes better school success.
| Positive (+) |
As one variable increases, the other also increases. |
Height and shoe size |
| Negative (-) |
As one variable increases, the other decreases. |
Price and sales volume |
| No Correlation (0) |
No linear relationship between variables. |
Random variables |
Pearson Correlation Analysis
- The Pearson correlation coefficient measures the linear relationship between two continuous (interval/ratio) variables.
Step 1: Covariance
- Covariance measures how two variables change together.
- Positive covariance → Positive relationship
- Negative covariance → Negative relationship
- Covariance formula:
\[Cov(x,y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{N-1}\]
Pearson Correlation coefficient
Step 2: Correlation
- Covariance is not standardized, making it hard to compare across different datasets.
- So, we normalize it to get the correlation coefficient:
\[r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 (y_i - \bar{y})^2}}\]
- Called Pearson correlation coefficient & can take values between -1 and +1.
- Before calculating the Pearson correlation, it’s important to visually check the relationship:
- Scatterplots help detect if the relationship is linear or non-linear.
Pearson correlation: linearity
- Pearson correlation only captures linear relationships. If the data form a curve or other non-linear pattern, Pearson’s r may not be appropriate.
![]()
If these conditions are not met, then the Spearman correlation is used.
What is Regression?
Regression is a statistical method to model the relationship between a dependent variable and one or more independent variables.
- It is used to measure the influence of predictors and predict outcomes.
Example: Predicting a person’s salary based on education level, weekly working hours, and age.
Dependent Variable: The variable being predicted (e.g., salary), and Independent Variables: Variables used for prediction (e.g., education, working hours, age).
Types of Regression
Regression helps understand how the dependent variable changes when one unit of an independent variable changes, holding others constant.
| Simple Linear Regression |
One independent variable |
| Multiple Linear Regression |
Two or more independent variables |
| Logistic Regression |
Predicts categorical outcomes (e.g., yes/no decisions) |
Simple vs. Multiple Regression
- Simple Regression: Use when one independent variable (IV) predicts the dependent variable (DV).
- Example: Does long work time (IV) affect a person’s income (DV)?
- Multiple Regression: Use when two or more independent variables predict the DV.
- Example: Do work hours (IV₁), age (IV₂), and education (IV₃) affect a person’s income (DV)?
Simple Linear Regression
- Predicts the value of a dependent variable (DV) based on one independent variable (IV).
- The stronger the linear relationship between IV and DV, the more accurate the prediction.
- A higher proportion of explained variance in the DV leads to better prediction quality.
- A scatter plot can illustrate the relationship — when the relationship is strong, data points cluster closely along a straight line.
Method of Least Squares
Linear regression uses the Method of Least Squares to find the best-fit line:
\[\hat{y} = b \cdot x + a\]
| \(\hat{y}\) |
Estimated dependent variable |
Predicted \(y\)-value for each \(x\)-value |
| \(x\) |
Independent variable |
The predictor input |
| \(b\) |
Slope |
How much \(y\) changes when \(x\) increases by one unit |
| \(a\) |
Intercept |
Where the line crosses the \(y\)-axis (value of \(\hat{y}\) when \(x = 0\)) |
Concept of Residual
The residual (error) is the difference between the actual and predicted \(y\)-values:
\[e = y - \hat{y}\]
- Goal: Minimize the sum of squared residuals (Ordinary Least Squares — OLS).
Interpretation of Slope (\(b\))
- \(b > 0\): Positive relationship (as \(x\) increases, \(y\) increases)
- \(b < 0\): Negative relationship (as \(x\) increases, \(y\) decreases)
- \(b = 0\): No relationship between \(x\) and \(y\)
Multiple Linear Regression
In multiple linear regression, more than one independent variable is used to predict a single dependent variable. It allows for more accurate and complex prediction by considering multiple influencing factors.
![]()
Example
- Investigating cholesterol levels in patients:
- Independent variables: age, hours of exercise per week, dietary habits, etc.
- Dependent variable: cholesterol level.
Interpretation
An increase in an independent variable \(x_i\) by one unit will change the dependent variable \(y\) by \(b_i\) units, holding all other variables constant.
Coefficient of Determination (R²)
R² (also known as variance explained) shows how much of the variance in the dependent variable can be explained by the independent variables in the model.
\[R^2 = \frac{S^2_{\hat{y}}}{S^2_{y}} = \frac{\text{Variance of the Predicted values}}{\text{Variance of the Observed values}}\]
- \(R^2 = 1\): Perfect fit — all variance explained.
- \(R^2 = 0\): No variance explained — independent variables do not help predict the dependent variable.
The higher the \(R^2\), the better the regression model fits the data.
Adjusted R²
R² increases as more independent variables are added to the model, even if those variables don’t contribute meaningfully. Adjusted R² compensates for this by penalising for the number of predictors.
\[R^2_{adj} = 1 - \left(1 - R^2\right)\cdot\frac{n-1}{n-p-1}\]
| \((1 - R^2)\) |
Unexplained variance |
Fraction of variance in the outcome not captured by the model |
| \((n - 1)\) |
Total degrees of freedom |
Anchors the baseline variability of the dataset |
| \((n - p - 1)\) |
Residual degrees of freedom |
Shrinks as \(p\) grows — this is the penalty term |
Why the penalty works
When a new predictor is added, \((n - p - 1)\) decreases. Two scenarios arise:
- Useful predictor: \((1 - R^2)\) drops meaningfully, offsetting the smaller denominator \(\Rightarrow\) \(R^2_{adj}\) increases or stays stable.
- Useless predictor: \((1 - R^2)\) barely changes, but the ratio \(\frac{n-1}{n-p-1}\) grows \(\Rightarrow\) \(R^2_{adj}\) falls, signalling the predictor was not worth including.
Key properties
- \(R^2_{adj} \leq R^2\) always, with equality only when \(p = 0\).
- \(R^2_{adj}\) can be negative if the model fits worse than a horizontal line (intercept only).
- As \(n \to \infty\), the penalty vanishes and \(R^2_{adj} \approx R^2\), because large samples make overfitting less of a concern.
- Use \(R^2_{adj}\) when comparing models with different numbers of predictors — it provides a fair, penalised basis for comparison.
Assumptions of Linear Regression
For the results of regression analysis to be valid, the following assumptions must be met:
| Linearity |
There must be a linear relationship between the dependent and independent variables |
| Homoscedasticity |
The variance of residuals must be constant across all levels of the independent variable(s) |
| Normality |
The errors (residuals) must be normally distributed |
| No Multicollinearity |
Independent variables should not be highly correlated with each other |
| No Autocorrelation |
Residuals should not show patterns or correlations across observations |
Linearity
Linear regression assumes a linear relationship between the dependent and independent variables. The goal is to draw a straight line that best represents the data points.
![]()
- Left Graph: A linear relationship is visible - data points align closely to the straight line, meaning a regression model will work effectively.
- Right Graph: A non-linear relationship is visible - a straight line cannot accurately represent the data, which may lead to incorrect predictions and conclusions.
Consequences of Non-Linearity
- Non-linearity can produce invalid regression coefficients and misleading predictions, leading to substantial errors and poor decision-making.
Homoscedasticity
In a regression model, there is always error (residuals) in predicting the dependent variable. Homoscedasticity means that the variance of the residuals is constant across all predicted values.
![]()
To test for homoscedasticity, plot: Dependent variable (DV) on the \(x\)-axis vs Residuals (errors) on the \(y\)-axis.
- If homoscedasticity exists → residuals scatter evenly across all values.
- If heteroscedasticity exists → residuals show varying spread depending on the range of the DV.
Heteroscedasticity causes inaccurate regression estimates and unreliable predictions, which may lead to incorrect conclusions.
Example
| 79 |
1.80 |
35 |
Male |
| 69 |
1.68 |
39 |
Male |
| 73 |
1.82 |
25 |
Male |
| 95 |
1.70 |
60 |
Male |
| 82 |
1.87 |
27 |
Male |
| 55 |
1.55 |
18 |
Female |
| 69 |
1.50 |
89 |
Female |
| 71 |
1.78 |
42 |
Female |
| 64 |
1.67 |
16 |
Female |
| 69 |
1.64 |
52 |
Female |
- The aim is to predict body weight.
- The dependent variable is body weight.
- The independent variables are body height, age, and gender.
Results
Interpretation of Results
Model Summary
- \(R^2 = 75.4\%\) → 75.4% of the variation in weight is explained by the independent variables (height, age, and gender).
- \(R^2_{adj} = 63\%\) → After adjusting for the number of predictors and degrees of freedom, about 63% of the variance is truly explained by the model. Adjusted R² provides a more realistic measure of model fit, penalising the addition of variables that don’t improve the model.
- Average prediction error (residual standard error) = 6.587 kg.
Regression Equation
\[\text{Weight} = 47.379 \times \text{Height} + 0.297 \times \text{Age} + 8.922 \times \text{is\_male} - 24.41\]
Interpreting Coefficients
- Age: Each additional year increases weight by 0.297 kg (holding other variables constant).
- Gender (is_male): Being male adds 8.922 kg to weight compared to females.
Hypothesis Testing
- \(H_0\): Coefficient \(= 0\) (no effect)
- \(H_1\): Coefficient \(\neq 0\) (has effect)
In this model, only Age has \(p < 0.05\), making it the only statistically significant predictor.
What is **Survival Analysis?
- Survival analysis is a statistical method used to examine the time until a specific event occurs, such as: Death, Disease onset, Relapse, Recovery, Equipment failure
- It focuses on time-based variables, measuring the duration between a start event and an end event.
- Time is typically recorded in days, weeks, or months.
Key Components
- Start time: The point when observation begins (e.g., diagnosis date).
- Event: The point when the outcome occurs (e.g., death, failure).
- Survival time: The time between the start and the occurrence of the event.
![]()
- Start = End of withdrawal process, Event = Relapse; Survival time = Number of days or weeks until relapse
- Time from disease diagnosis to death: Start = Diagnosis date, Event = Death
What is Censoring?
- Censoring occurs when:
- The event of interest has not happened by the study’s end.
- The subject leaves the study before the event occurs.
Censoring (continued)
- Ignoring censoring leads to biased results and incorrect survival estimates.
- Censoring is a natural and common part of survival studies and a critical feature that distinguishes survival analysis from other types of statistical modeling.
- Imagine you are a dental technician analyzing the lifespan of tooth fillings:
- Start time: Day the filling is placed.
- Event: Filling breaks or falls out.
- You record the survival time for each patient’s filling. However, two important scenarios arise:
- A patient’s filling has not failed by the end of the study.
- A patient drops out (moves away, changes dentist) before the filling fails.
- These situations introduce censoring.
Basic Concepts of Survival analysis
- Survival time analysis uses specialized statistical methods designed to handle time-to-event data and account for censoring.
The three most common methods are:
- Kaplan-Meier Survival Curves
- Log-Rank Test
- Cox Proportional Hazards Regression
Kaplan-Meier Curve
The Kaplan-Meier curve is one of the most widely used methods for estimating survival functions.
It answers a key question: What is the probability that the event of interest has not occurred by a certain time point?
Kaplan-Meier Plot Axes: X-axis: Time (e.g., days, weeks, months), Y-axis: Probability of survival (from 1 or 100% down to 0)
The Kaplan-Meier curve would display the probability that a filling remains intact over time.
With the curve, you can answer questions like:
- What proportion of fillings last at least 5 years?
- How rapidly does the failure rate increase after insertion?
- At what point have 50% of the fillings failed (the median survival time)?
Comparing different groups
- When studying survival time, researchers often want to compare two or more groups. For example:
- Comparing patients receiving two different treatments.
- Comparing male vs. female patients.
- Comparing different age groups or exposure levels.
- In such cases, the Kaplan-Meier curve is drawn separately for each group. Each line on the plot represents the estimated survival rate over time for a particular group.
Comparing different groups (continued)
Visual comparison of the survival curves can suggest differences, for example, one group may show faster failure rates than another.
- However, visual inspection alone is not enough.
We need a formal statistical test to check whether these differences are statistically significant using Log-Rank Test.
The Log-Rank Test is a statistical test used to compare survival distributions between two or more independent groups.
It answers the question: Is there a statistically significant difference in the time-to-event between the groups?
The Log-Rank test is based on comparing the observed number of events in each group with the expected number of events under the assumption that the groups have the same survival experience.
Hypotheses in the Log-Rank Test
- Null Hypothesis (H₀): The survival distributions of the groups are identical. (There is no difference in survival times between groups.)
- Alternative Hypothesis (H₁): The survival distributions of the groups are different. (There is a difference in survival times between groups.)
Result
Interpreting the Log-Rank Test
- If the p-value is small (typically < 0.05):
- Reject the null hypothesis.
- Conclude that there is a statistically significant difference between the groups’ survival experiences.
- If the p-value is large (typically ≥ 0.05):
- Fail to reject the null hypothesis.
- Conclude that there is no significant difference between the groups.
Cox Regression
- What is Cox Regression?
- Cox Regression (also called the Cox Proportional Hazards Model) is a method used in survival analysis to:
- Examine the influence of several variables (continuous, binary, or categorical) on survival time.
- Adjust for multiple variables at the same time.
- Predict how changes in the variables affect the risk (hazard) of the event occurring.
- allows us to determine the effects of multiple independent variables on a time-to-event outcome,
- to test hypotheses about which factors impact survival,
- to build predictive models based on those factors.
Cox Regression Model (Cox Proportional Hazards Model)
The Cox model describes the hazard (risk) of an event over time:
\[H(t) = H_0(t) \times \exp\left(b_1 x_1 + b_2 x_2 + \cdots + b_k x_k\right)\]
- \(H_0(t)\): Baseline hazard when all predictors are zero.
- \(x_1,\ldots,x_k\): Predictor variables, \(b_1,\ldots,b_k\): Regression coefficients.
- The exponential of a coefficient exp(b) gives the Hazard Ratio (HR):
- For binary variables (e.g., exposed vs. unexposed), HR shows how much more (or less) likely the event is in the exposed group.
- For continuous variables (e.g., age), HR shows the change in hazard per one-unit increase:
- Example: HR = 1.03 for age → each year increases the hazard by 3%.
- Example: HR = 0.85 for albumin → each 1 g/dL increase reduces the hazard by 15%.
- Interpretation assumes other covariates are held constant.
Cox Regression: data
- Let’s assume that we have the following data and we want to evaluate them.
Cox Regression: results
- The following results of a Cox regression is generated from the above data
Interpretation
- Coefficient (β): Reflects the direction and strength of the variable’s association with survival.
- Negative β → Lower risk (longer survival).
- Positive β → Higher risk (shorter survival).
- P-value: Tests whether the coefficient is significantly different from zero.
Hypothesis Testing in Cox Regression
- Null Hypothesis (H₀): The coefficient is zero (no effect on survival).
- Alternative Hypothesis (H₁): The coefficient is not zero (there is an effect on survival).
- Decision rule:
- If p-value < 0.05 → Reject H₀ → Variable has a significant effect.
BASIC CONCEPTS OF LONGITUDINAL ANALAYSIS
Basic concepts of Longitudinal Data analysis
- In survey research and data collection, two major methods are:
- Longitudinal studies and Cross-sectional studies
- Both are widely used across health, social science, and humanitarian fields.
- Knowing their differences is crucial for designing effective research and managing data collection workflows
Key Differences
| Timing |
Over multiple time points |
At a single point in time |
| Objective |
Track changes or trends |
Describe current status |
| Subjects |
Same individuals followed |
Different individuals sampled |
| Strengths |
Detects cause-and-effect |
Quick, cost-effective |
When to Use a Longitudinal Study
- When the research question involves changes, trends, or trajectories over time.
- When exploring the effect of time-dependent exposures.
- When within-subject comparisons are critical.
- When aiming to establish temporal sequences for potential causal inferences.
Example: Tracking changes in maternal nutritional status from early pregnancy to postpartum to examine impacts on birth outcomes.
Key Features of Longitudinal Data
- Statistical techniques like ANOVA and regression usually assume independent and identically distributed (iid) residuals.
- In practice, data often violate this assumption due to correlations among observations.
- Correlated data structures include:
- Clustered data, Repeated measurements, Spatially correlated data
Examples for clustered data: Families, Schools, Hospitals, Towns.
- When repeated measurements are collected over time, the resulting data form a longitudinal (or panel) study.
- allows researchers to examine how individuals change over time and what factors influence those changes.
- Therefore, these correlated data have correlation among observations, which violates assumption of iid.
Objectives of Longitudinal Studies
- Characterize how a response changes over time.
- Identify factors that influence these changes.
- Key Advantages:
- Distinguish within-individual changes
- (e.g., how a person’s condition evolves measure by score for tests).
- Separate between-individual differences
- (e.g., how people differ overall).
Exploratory Data Analysis (EDA) in Longitudinal Studies
- What is Exploratory Data Analysis (EDA)?
- Exploratory analysis comprises techniques to visualize patterns in the data.
- Data analysis must begin by making displays that expose patterns relevant to the scientific question.
- Therefore EDA helps:
- Helps uncover expected and unexpected patterns.
- Graphical displays are crucial to highlight relationships and trends.
- Indicates which model will be appropriate for analysis.
Example: Jimma Infant Survival Data
A follow-up study of newborn infants in Southwest Ethiopia with measurements at 7 time points (every 2 months, 0-12 months).
ind |
Infant ID |
Numeric (ID) |
sex |
Sex of the infant |
Categorical |
place |
Place of residence |
Categorical (1=Urban, 2=Rural) |
weight |
Weight (grams) |
Numeric |
length |
Length/height (cm) |
Numeric |
Bf |
Breastfeeding status |
Binary (1=Yes, 0=No) |
age |
Age (months: 0,2,4,6,8,10,12) |
Numeric |
BMIBIN |
BMI category |
Binary (1=Normal, 0=Other) |
Study Overview
- Follow-up period: 12 months with Seven time points per child
- Measurements taken every two months
- Weight recorded at each visit
- Research Question: How does weight change over time?
individual Vs Mean profiles
Conclusions from the profile:
- Much variability between children
- Fixed number of measurements per subject
- Considerable variability within subjects
- Measurements taken at fixed time points
Exploratory analysis conclusion
Conclusion - From the exploratory analysis:
- Mean structure seems linear over time.
- Variability between subjects at baseline.
- Variability between subjects in the way they evolve.
- Hence, a linear mean with random interception and slope is a good idea…
Exploring the random effects
- Choosing Random Effects:
- Decide which parameters need random effects to account for between-group variation.
- Covariance Structure:
- The pairs method helps explore the covariance structure among random effects.
Linear Mixed Model
- Correctly modeling correlation is essential for valid inference about regression coefficients.
- Extend general linear models to account for correlated errors.
- Combine:
- Fixed effects (e.g., sex, age group)
- Random effects (e.g., individual subjects)
- LMMs make assumptions about:
- Mean structure: linear or nonlinear
- Variance function: constant or changing (e.g., quadratic)
- Correlation structure: independent, serial, etc.
- Subject-specific profiles: linear, quadratic, etc.
The model is given by
\[Y_i = \underbrace{X_i \beta}_{\text{fixed effect}} + \underbrace{Z_i b_i}_{\text{random effect}} + \varepsilon_i\]
Where:
\[b_i \sim N(0, D), \quad \varepsilon_i \sim N(0, \Sigma_i)\]
\[b_1, b_2, \ldots, b_N,\ \varepsilon_1, \varepsilon_2, \ldots, \varepsilon_N \ \text{are independent}\]
| \(\beta\) |
Fixed effects (population-level) |
| \(b_i\) |
Random effects (individual-level) |
| \(D\) |
Covariance of random effects |
| \(\Sigma_i\) |
Residual covariance |
LMM result for Jimma infant data
| Age |
433 |
389, 477 |
< 0.001 |
| Sex - Female (ref: Male) |
282 |
−273, 837 |
0.3 |
| Breastfed (ref: Not breastfed) |
244 |
−314, 802 |
0.4 |
| Random Effect |
|
|
|
| Random intercept |
345.302 |
|
|
| Random slope of time |
54.087 |
|
|
| Residual |
656.924 |
|
|
| ICC |
34.5% |
|
|
Age is the only highly significant predictor: each additional month increases weight by approximately 433 grams on average, adjusting for sex and breastfeeding status.
Intraclass Correlation Coefficient (ICC)
The ICC quantifies what proportion of the total variability is attributable to between-individual differences.
\[ICC = \frac{\sigma^2_{\text{between}}}{\sigma^2_{\text{between}} + \sigma^2_{\text{within}}} = \frac{345.302}{345.302 + 656.924} \approx 0.345\]
| Between-individual differences |
34.5% |
| Within-individual changes over time |
65.5% |
- An ICC of 34.5% indicates substantial individual variation in both initial weight and growth rate.
- This confirms that ignoring random effects would lead to invalid standard errors and misleading inference.
Thank you!
Questions & discussion are welcome
Descriptive statistics
Hypothesis testing
Regression
Survival analysis
Longitudinal data